LS4003 R worksheet 1

Penguins laying eggs

Introduction to the data

For this worksheet, you will need the penguins.csv file from the Canvas page.

This is data from a study which looked at penguin couples. Once a penguin pair had laid an egg, both parents were captured and measurements were taken. They were then tracked to see if they would go on to have a second egg or “Full Clutch”.

This dataset contains the following information:

Penguins data columns and descriptions
Column Data
SampleNumber Unique number for every penguin
LatinName Species name, full Latin name
CommonName Common name for the penguin species
FullClutch Whether or not the penguin couple laid two eggs
FlipperLength Length of the penguin’s flipper in millimeters
BodyMass Mass of the penguin in grams
Sex Whether the penguin is male or female

And is looking at penguins of the following three species:

Adelie Penguin

Adelie Penguin

Chinstrap Penguin

Chinstrap Penguin

Gentoo Penguin

Gentoo Penguin

The tasks

Task 1: Read the data

To start off, you will want to:

  1. Read the csv file into R as a dataframe
  2. Use summary to get an overview of the data
  3. Calculate the mean and standard deviation of the numerical columns

Task 2: Simple boxplots

Now you have read the data into R, you should be able to use ggplot2 to draw boxplots.

Make boxplots of:

  1. FlipperLength, separated by species
  2. BodyMass, separated by species

Which value do you want on the x-axis? What about the y-axis?

To colour by species, try using the fill = option in aes().

A boxplot of FlipperLength separated by species should look like this:

Task 3: Split boxplots by multiple categories

In the last example we used the same column of data for the x axis and the fill colour. If we change one of these, we can visualise the data further.

Make boxplots of:

  1. FlipperLength, separated by species and Sex
  2. FlipperLength, separated by species and FullClutch
  3. BodyMass, separated by species and Sex
  4. BodyMass, separated by species and FullClutch

A boxplot of FlipperLength separated by species and Sex should look like this:

Task 4: Visualise distributions of FlipperLength and BodyMass

Next, use a histogram to visualise the distributions of sizes. You may want to use the fill = option to separate by one of the categories.

Make histograms of:

  1. FlipperLength, separated by species
  2. BodyMass, separated by species

A histogram of of FlipperLength separated by species and Sex should look like this:

Task 5: Use filter to plot results for individual species

  1. Using the filter() function, extract the values of just one species of penguin and save these in a new dataframe.
  2. Repeat as in Task 4, but separate the distributions by Sex or FullClutch

A histogram of of FlipperLength separated by species and Sex should look like this:

Extension: Larger penguin dataset

If you finish all of the above and would like to challenge your skills, download the Penguins_extension.csv from the Canvas page.

This is the same dataset but with more columns:

Penguins extension data columns and descriptions
Column Data
SampleNumber Unique number for every penguin
LatinName Species name, full Latin name
CommonName Common name for the penguin species
Region Region of Antartica the penguin is in
Island Island of Antartica the penguin’s nest is in
FullClutch Whether or not the penguin couple laid two eggs
EggDate Date at which the first egg was laid
CulmenLength Length of the penguin’s culmen in millimeters
CulmenDepth Depth of the penguin’s culmen in millimeters
FlipperLength Length of the penguin’s flipper in millimeters
BodyMass Mass of the penguin in grams
Sex Whether the penguin is male or female

One column of this dataset is dates. By default these will be treated as strings - as words with no set order.

This means that dates won’t be plotted in order from oldest to newest, which is not very useful.

To convert our dates into a “Date” class - to tell R that these are dates - we can use the following code:

penguin_df$EggDate <- as.Date(penguin_df$EggDate, format = "%d/%m/%y")

At this point, it’s up to you what you want to do - explore the data! What can you find out? Are there any patterns emerging?

This is the end of the worksheet. Don’t worry if it still seems a bit alien - we have four more sessions after the winter break.

GIF of a baby penguin

GIF of a baby penguin